This topic modeling was carried out on reviews extracted from the Yelp database. The Yelp database contains 6 990 279 reviews, of which 156 067 are restaurant reviews. The restaurant reviews contain 32 593 negative reviews, or 1 to 2 star reviews.
We will be carrying out the topic modeling on a sample of 5000 restaurant reviews from the 32 593 negative reviews filtered out from the original Yelp database.
In order to be able to analyze reviews for topics, or in other words perform a topic modeling, we will need to preprocess the text in the reviews. The technique chosen for this project is the following:
| text | clean_text | |
|---|---|---|
| 0 | We went here for brunch on a Sunday with a lar... | [brunch, sunday, party, celebrate, family, tim... |
| 1 | Slackssss once you get trapped in line you wil... | [line, mile] |
| 2 | Must have been seated by the managers nephew. ... | [manager, personality, dish, hole, want, stran... |
| 3 | Sub-par. Go to Zaik instead. | [zaik] |
| 4 | A co-worker told me about this new opening spo... | [worker, spot, milk, check, california, hype, ... |
| ... | ... | ... |
| 4995 | Very disappointing. Customer service was poor.... | [customer, service, attention, waiter, bill, k... |
| 4996 | I used to go her ln highschool with my girlfri... | [highschool, taco, taco, chicken, taco, burrit... |
| 4997 | WARNING!!! DO NOT ORDER FROM THIS PLACE!!!! I... | [order, place, hotel, room, yelp, star, review... |
| 4998 | If you like eggs in a carton this place is for... | [carton, place, toast, side, mystery, hubby, o... |
| 4999 | The 2 is for the food. Service is 1 at best. ... | [service, order, water, order, chip, salsa, ch... |
5000 rows × 2 columns
The most frequent words are mostly stop words, these words are not at all helpful to us and show us the necessity of performing a text preprocessing before modeling the topics.
Number of tokens: 699680, Number of unique tokens: 20668 ['We', 'went', 'here', 'for', 'brunch', 'on', 'a', 'Sunday', 'with', 'a', 'larger', 'party', 'to', 'celebrate', 'a', 'few', 'birthdays', 'in', 'my', 'family', ',', 'but', 'there', 'was', 'no', 'one', 'else', 'in', 'the', 'restaurant']
There is a clear improvement in our corpus since we now only have restaurant-related words which can be used by our model to detect the topics.
Number of tokens: 98521, Number of unique tokens: 6939 ['brunch' 'sunday' 'party' 'celebrate' 'family' 'time' 'reservation' 'time' 'cake' 'dollar' 'person' 'cake' 'half' 'party' 'side' 'everyone' 'plate' 'dish' 'salsa' 'mine' 'meal' 'waitress' 'chef' 'time' 'meal' 'waitress' 'someone' 'meal' 'everyone' 'meal']
The LDA or the Latent Dirichlet Allocation model will allow us to detect topics in our reviews, sets of words that cluster together and have a higher probability to appear together in a review.
At first we will use a smaller sample of 1000 of the reviews in order to calculate the ideal number of topics, after which we will give our final model 5000 reviews to have the most precision in our topics.
In order to calculate the ideal number of topics, we check the coherence score of our model. We have the option of using u_mass coherence, which is faster computationally but less accurate or to use c_v, which is slower but more accurate. I decided to use a sample of 1000 reviews and calculate the c_v, which explains the apparently lower than expected score. If we were to take a larger sample size or use the whole dataset of reviews, our c_v score would be much higher.
Now that we have decided on 4 topics, the area of the graph where we see a clear elbow, we will use our final LDA model with 5000 reviews.
[(0, '0.039*"pizza" + 0.028*"chicken" + 0.025*"menu" + 0.023*"sauce" + 0.021*"taste" + 0.020*"meal" + 0.019*"price" + 0.019*"flavor" + 0.018*"meat" + 0.017*"dinner"'), (1, '0.110*"place" + 0.051*"restaurant" + 0.031*"thing" + 0.024*"review" + 0.022*"star" + 0.020*"year" + 0.019*"something" + 0.019*"night" + 0.018*"anything" + 0.015*"everything"'), (2, '0.105*"time" + 0.101*"order" + 0.079*"service" + 0.029*"experience" + 0.027*"location" + 0.027*"staff" + 0.022*"nothing" + 0.018*"drink" + 0.018*"server" + 0.018*"wait"'), (3, '0.041*"customer" + 0.039*"minute" + 0.029*"manager" + 0.024*"hour" + 0.017*"table" + 0.017*"business" + 0.016*"money" + 0.016*"employee" + 0.015*"someone" + 0.015*"room"')]
We can see four clear topics emerge:
The data used for the automatic image labeling is the Yelp database, which contains 200 100 photos, with potential labels of: inside, outside, drink, food or menu.
In order to equally represent each label, we will use 100 images from each label in our sample (we used 200 in the Jupyter notebooks), so we will be working on 500 images total.
The objective of this exercise is to remove the original labels and to choose the most accurate method of automatically labeling the images.
Two methods have been tested to choose the best-performing method:
Oriented FAST and Rotated BRIEF, or ORB for short, is an open source alternative to similar algorithms such as SIFT and SURF, which are patented. It uses FAST to detect the image keypoints and then computes the BRIEF descriptors. (More information here: https://docs.opencv.org/3.4/d1/d89/tutorial_py_orb.html)
Before being able to detect descriptors in an image, we need to preprocess our images (or equalize their histograms, in our case). Let's take an image from our sample images and preprocess it:
We see that the histogram has been equalized, let's now use ORB to compute 100 features and show them on the image:
Each keypoint has an array of descriptors, which is the data we will ask KMeans to analyze and separate into 5 clusters (our 5 labels).
Dimension of descriptors: 500 descriptors of length 2000 First descriptor: [ 1 249 27 ... 35 28 117]
We will then use PCA to reduce the dimension of our descriptors to explain 90% of the variance.
Dimension after PCA reduction : (500, 352)
And finally we will use KMeans to cluster our descriptors into 5 clusters (representing food, drink, outside, inside, menu).
array([0, 2, 1, 2, 2, 3, 2, 0, 3, 4, 4, 1, 4, 1, 4, 0, 4, 0, 3, 0, 3, 2,
1, 4, 3, 3, 3, 2, 3, 3, 2, 2, 4, 3, 4, 3, 0, 1, 1, 2, 4, 3, 3, 3,
0, 3, 2, 2, 3, 0, 0, 1, 2, 3, 3, 0, 2, 4, 3, 0, 1, 1, 1, 0, 3, 3,
4, 3, 3, 3, 3, 4, 1, 4, 2, 3, 1, 3, 2, 0, 3, 4, 3, 0, 4, 3, 0, 2,
0, 4, 4, 0, 0, 3, 0, 4, 2, 3, 2, 3, 2, 0, 2, 3, 1, 2, 0, 4, 1, 0,
0, 0, 2, 3, 4, 4, 1, 2, 3, 3, 3, 1, 0, 3, 3, 3, 4, 3, 3, 4, 3, 1,
3, 0, 4, 3, 0, 3, 2, 3, 0, 0, 0, 2, 2, 3, 0, 2, 2, 1, 1, 3, 3, 0,
3, 1, 4, 0, 3, 4, 4, 1, 3, 2, 2, 2, 4, 4, 1, 3, 4, 4, 2, 4, 1, 4,
0, 4, 0, 0, 1, 4, 0, 3, 3, 2, 3, 0, 4, 4, 2, 4, 2, 2, 4, 4, 3, 0,
1, 2, 0, 4, 1, 2, 0, 3, 3, 4, 4, 1, 2, 3, 1, 3, 4, 4, 2, 2, 4, 0,
0, 2, 3, 1, 4, 0, 4, 4, 1, 4, 4, 0, 0, 4, 4, 3, 0, 0, 2, 0, 0, 2,
4, 0, 0, 0, 0, 3, 2, 3, 4, 2, 3, 4, 2, 2, 2, 2, 3, 0, 2, 2, 3, 1,
4, 4, 0, 3, 0, 4, 4, 4, 4, 1, 4, 4, 3, 2, 4, 0, 0, 3, 3, 2, 2, 3,
0, 4, 0, 4, 4, 0, 0, 2, 0, 0, 4, 2, 0, 3, 2, 0, 4, 3, 3, 3, 2, 0,
0, 3, 2, 0, 2, 4, 1, 0, 1, 3, 0, 0, 3, 2, 3, 4, 3, 2, 0, 1, 3, 4,
3, 4, 3, 2, 3, 4, 2, 4, 0, 0, 2, 4, 1, 0, 0, 2, 4, 4, 4, 2, 4, 3,
3, 3, 4, 4, 0, 3, 0, 0, 3, 1, 4, 4, 4, 3, 4, 3, 4, 4, 1, 3, 4, 3,
1, 0, 1, 0, 3, 4, 4, 0, 4, 4, 2, 0, 0, 0, 4, 3, 4, 4, 0, 0, 4, 1,
4, 1, 0, 4, 4, 3, 0, 4, 0, 3, 0, 0, 1, 2, 2, 3, 0, 4, 2, 4, 4, 2,
0, 4, 2, 4, 0, 4, 4, 0, 2, 0, 0, 4, 3, 1, 4, 2, 0, 4, 4, 0, 3, 2,
3, 0, 0, 2, 2, 1, 0, 4, 3, 0, 4, 2, 1, 4, 2, 0, 2, 2, 3, 0, 3, 4,
0, 2, 2, 2, 0, 1, 1, 4, 1, 3, 2, 0, 0, 2, 2, 2, 3, 4, 2, 4, 3, 3,
0, 2, 0, 2, 3, 4, 4, 4, 4, 1, 3, 4, 0, 4, 0, 1])
label kmeans_label
drink 4 29
0 27
2 20
3 17
1 7
food 4 31
0 24
3 23
2 12
1 10
inside 3 33
0 19
2 18
4 18
1 12
menu 0 26
4 26
2 24
3 15
1 9
outside 3 26
4 22
0 20
2 19
1 13
Name: kmeans_label, dtype: int64
We can notice with the comparison between the true label and the kmeans label that there is no noticeable pattern, we cannot say that the kmeans label 1 is most often given to photos with a true label of "food" for example.
Let's visualize the data with t-SNE in two dimensions as a last attempt to find associations between the true and predicted labels.
ARI : 0.015019048620507871
The adjusted rand index (ARI) of the true labels vs. the predicted labels is 1%. This is a low number and it means our clustering with ORB predictors did not work as well as we hoped. In the next part, we will try to do something similar using a deep learning algorithm and hopefully get better results.
A Convolutional Neural Network (CNN or ConvNet) is a type of neural network typically used in image recognition, image classification, object detections, face recognitions, etc. We will be using a neural network which has convolutional layers for feature extraction and pooling layers for reducing the dimensions of our feature maps.
We will not be using the fully connected layers or softmax because we will not be training the model at this point but using the model pretrained on the labeled ImageNet database, so we will be carrying out Transfer Learning, or reusing a previously trained model on a new problem.
Here is our model, where as expected we do not have the last fully connected and softmax layers:
Model: "vgg16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 224, 224, 3)] 0
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 0
Non-trainable params: 14,714,688
_________________________________________________________________
The shape of our extracted features from the sample image is (7,7,512), corresponding to the last layer of our model or the last "max pooling" layer in VGG16 diagram above.
1/1 [==============================] - 4s 4s/step (1, 7, 7, 512)
ARI : 0.5632777907520027
We achieve an ARI score of 56% which is pretty good considering the model isn't trained on our data and we have only given it 500 photos to work with. Thanks to the t-SNE plots, we can see which true labels correspond to the KMeans predicted labels. We are going to map them to their names in order to have a clear classification of our results.
Classification Report
precision recall f1-score support
drink 0.80 0.84 0.82 100
food 0.95 0.81 0.88 100
inside 0.73 0.64 0.68 100
menu 0.85 0.92 0.88 100
outside 0.63 0.72 0.67 100
accuracy 0.79 500
macro avg 0.79 0.79 0.79 500
weighted avg 0.79 0.79 0.79 500
******************************************************
Confusion Matrix
We have a total precision of 79%. The label predicted with most precision is "food" whilst the least precision was "outside". The labels most often confused are "inside" and "outside, as well as "food" and "drink".
"Inside" predicted as "Outside": 26 occurences
The reason for these mismatchings are probably the lighting in the photos. The second and fourth photo have a colder lighting that is similar to natural light. The first and third photo also have framed pictures which might be misinterpreted by the model, especially the third photo which has a framed picture of the sky with the sun.
"Outside" predicted as "Inside": 22 occurences
The first photo is probably mislabeled from the beginning, as it is truly a photo of an interior. The second photo is under some sort of roof which might be misinterpreted as a ceiling. The fourth photo also has most of the top area covered with an umbrella and a roof which might be misinterpreted as a ceiling.
"Food" predicted as "Drink": 12 occurences
The first and third photo both have a round shaped bowl of sauce which might be misinterpreted as a drink. The second has a similar plastic transparent cup and might be misinterpreted for resembling a small cup. The last photo might have been mislabeled due to the presence of a lime, which is often found on drinks.
"Drink" predicted as "Outside": 8 occurences
The first photo has very natural lighting which might be causing the mislabeling, and the second photo seems to have been taken outside. The items on the photo are not very identifiable so it isn't surprising that the model is having some trouble.
In synthesis, using the features extracted with the transfer learning model via VGG16 performed much better than using the features extracted through ORB. If we were to carry out a supervised learning on the VGG16 model and test it using not just 500 photos but thousands, we would certainly reach very satisfactory precision.